Goto

Collaborating Authors

 Sendai


SoftMatcha 2: A Fast and Soft Pattern Matcher for Trillion-Scale Corpora

arXiv.org Machine Learning

We present an ultra-fast and flexible search algorithm that enables search over trillion-scale natural language corpora in under 0.3 seconds while handling semantic variations (substitution, insertion, and deletion). Our approach employs string matching based on suffix arrays that scales well with corpus size. To mitigate the combinatorial explosion induced by the semantic relaxation of queries, our method is built on two key algorithmic ideas: fast exact lookup enabled by a disk-aware design, and dynamic corpus-aware pruning. We theoretically show that the proposed method suppresses exponential growth in the search space with respect to query length by leveraging statistical properties of natural language. In experiments on FineWeb-Edu (Lozhkov et al., 2024) (1.4T tokens), we show that our method achieves significantly lower search latency than existing methods: infini-gram (Liu et al., 2024), infini-gram mini (Xu et al., 2025), and SoftMatcha (Deguchi et al., 2025). As a practical application, we demonstrate that our method identifies benchmark contamination in training corpora, unidentified by existing approaches. We also provide an online demo of fast, soft search across corpora in seven languages.



PARD: Permutation-invariantAutoregressiveDiffusion forGraphGeneration

Neural Information Processing Systems

Specifically, we show that contrary to sets, elements in a graph are not entirely unordered and there is a unique partial order for nodes and edges. With this partial order,PARD generates a graph in a block-by-block, autoregressivefashion, where each block'sprobability isconditionally modeled by a shared diffusion model with an equivariant network.


Data-Driven Global Sensitivity Analysis for Engineering Design Based on Individual Conditional Expectations

arXiv.org Machine Learning

Explainable machine learning techniques have gained increasing attention in engineering applications, especially in aerospace design and analysis, where understanding how input variables influence data-driven models is essential. Partial Dependence Plots (PDPs) are widely used for interpreting black-box models by showing the average effect of an input variable on the prediction. However, their global sensitivity metric can be misleading when strong interactions are present, as averaging tends to obscure interaction effects. To address this limitation, we propose a global sensitivity metric based on Individual Conditional Expectation (ICE) curves. The method computes the expected feature importance across ICE curves, along with their standard deviation, to more effectively capture the influence of interactions. We provide a mathematical proof demonstrating that the PDP-based sensitivity is a lower bound of the proposed ICE-based metric under truncated orthogonal polynomial expansion. In addition, we introduce an ICE-based correlation value to quantify how interactions modify the relationship between inputs and the output. Comparative evaluations were performed on three cases: a 5-variable analytical function, a 5-variable wind-turbine fatigue problem, and a 9-variable airfoil aerodynamics case, where ICE-based sensitivity was benchmarked against PDP, SHapley Additive exPlanations (SHAP), and Sobol' indices. The results show that ICE-based feature importance provides richer insights than the traditional PDP-based approach, while visual interpretations from PDP, ICE, and SHAP complement one another by offering multiple perspectives.


Storage capacity of perceptron with variable selection

arXiv.org Machine Learning

A central challenge in machine learning is to distinguish genuine structure from chance correlations in high-dimensional data. In this work, we address this issue for the perceptron, a foundational model of neural computation. Specifically, we investigate the relationship between the pattern load $α$ and the variable selection ratio $ρ$ for which a simple perceptron can perfectly classify $P = αN$ random patterns by optimally selecting $M = ρN$ variables out of $N$ variables. While the Cover--Gardner theory establishes that a random subset of $ρN$ dimensions can separate $αN$ random patterns if and only if $α< 2ρ$, we demonstrate that optimal variable selection can surpass this bound by developing a method, based on the replica method from statistical mechanics, for enumerating the combinations of variables that enable perfect pattern classification. This not only provides a quantitative criterion for distinguishing true structure in the data from spurious regularities, but also yields the storage capacity of associative memory models with sparse asymmetric couplings.


Portuguese Man O'War species honors 'One-Eyed Dragon' samurai

Popular Science

The newly discovered P. mikazuki is a tribute the famous warrior Date Masamune. Breakthroughs, discoveries, and DIY tips sent every weekday. A team of university students in Japan identified an entirely new species of the mighty Portuguese Man O'War . Described in a study recently published in the journal, the creature's distinct features and fearsome venom have earned it a name that honors a famous 16th century samurai warrior. It's easy to mistake the Portuguese Man O'War () for a jellyfish .


Yoshihiro Murai clinches sixth term as Miyagi governor

The Japan Times

Yoshihiro Murai, 65, celebrates his victory in the Miyagi gubernatorial election on Sunday night. SENDAI - Yoshihiro Murai held off four other candidates to clinch his sixth term as governor of Miyagi Prefecture in Sunday's gubernatorial election. Murai, an independent candidate who had support from prefectural assembly members of the Liberal Democratic Party, Japan Innovation Party and Komeito, highlighted his achievements as the prefecture's governor spanning five terms, or 20 years. The 65-year-old former chief of the National Governors' Association pledged to enhance productivity by promoting digital transformation using generative artificial intelligence, in anticipation of a further population decline. He successfully fended off Masamune Wada, 51, also an independent candidate, who had been closing in.


Man held in Japan on suspicion of creating female celeb deepfakes made with AI

The Japan Times

Tokyo police believe the man made about 20,000 sexually explicit images of 262 women, such as actors and idols, and amassed sales of ¥1.2 million between October last year and September this year. Tokyo police have arrested a 31-year-old man for allegedly creating fake sexual images of female celebrities with generative artificial intelligence technology and displaying them online, it was learned Thursday. It is the first time that police in Japan have cracked down on sexual deepfake images of celebrities created with generative AI. The suspect, Hiroya Yokoi of the city of Akita, has admitted he began making deepfakes to earn a small amount of money, which he used to cover living expenses and repay a student loan. Authorities believe Yokoi made a total of about 20,000 sexually explicit images of 262 women, such as actors, television personalities and idols, and amassed sales of ¥1.2 million between October last year and September this year.


Japan's government asks OpenAI to seek permission amid Sora 2 copyright concerns

The Japan Times

In a time of both misinformation and too much information, quality journalism is more crucial than ever. By subscribing, you can help us get the story right. With your current subscription plan you can comment on stories. However, before writing your first comment, please create a display name in the Profile section of your subscriber account page. Your subscription plan doesn't allow commenting.